pacman::p_load(jsonlite,dplyr,tidyr,stringr,lubridate,tidyverse,readtext,ggplot2,visNetwork,stringr,ggpubr, igraph, patchwork,igraph,ggraph,ggrepel)Take Home 3

Credit: John Blank Photography
1. Introduction
FishEye International monitors business records of commercial fishing operators in Oceanus to identify and prevent illegal fishing. Analysts work with company data, including ownership, shareholders, transactions, and products/services, to build the CatchNet Knowledge Graph.
Last year, SouthSeafood Express Corp was caught illegally fishing, leading to its closure. FishEye wants to understand the temporal patterns and infer how the fishing market reacted to this event, as some businesses may have tried to capture SouthSeafood’s market share, while others may have become more cautious about illegal activities.
FishEye aims to develop visualization tools for CatchNet to identify influential people in business networks, considering the varied and changing shareholder and ownership relationships.
2. Mini-Challenge 3: Temporal Analysis
2.1 Tasks and Questions:
Develop an approach using visual analytics to highlight temporal patterns and changes in corporate structures, focusing on identifying the most active individuals and businesses.
Utilize visualizations to display typical and atypical business transactions, such as mergers and acquisitions, and infer the underlying motivations behind changes in their activity levels.
Create a visual approach to examine inferences about how a company’s influence evolves over time and whether ownership or influence over a network can be deduced.
Identify and visualize the network of SouthSeafood Express Corp and competing businesses, and how they changed due to SouthSeafood’s illegal fishing behavior.
Determine which companies benefited from SouthSeafood’s legal troubles and if there are other suspicious transactions related to illegal fishing, providing visual evidence to support your conclusions.
Data Source: VAST Challenge 2024: Mini-Challenge 3
3. Data
The MC3 dataset is a comprehensive collection comprising 60,520 nodes (entities) and 75,817 edges (relationships or connections) organized into 4,782 distinct components. The nodes in this dataset represent various types of entities, including individuals (Person), chief executive officers (CEO), companies, and other organizational structures.
On the other hand, the edges capture different types of relationships or interactions between these nodes. Some examples of edge types include shareholdership (ownership of shares in a company), beneficial ownership (enjoying the benefits of owning a property or asset without being the legal owner), and potentially other forms of associations or transactions.
The dataset is structured such that the nodes key contains a list representing all the node entities, with each node carrying attributes or properties that describe its characteristics, such as ID, type, country, revenue, founding date, and potentially other relevant details.
Correspondingly, the links key holds a list that represents all the edges or connections between these nodes. Each edge entry typically includes properties like the edge type, start and end dates (if applicable), and identifiers for the source and target nodes involved in the relationship.
With this comprehensive dataset capturing both node entities and their interconnections through various types of edges, researchers and analysts can conduct in-depth analyses to uncover patterns, understand the dynamics of relationships, and gain insights into the complex network of entities and their interactions.
3.1 Data Preparations
- jsonlite: This package provides functionality for parsing and generating JSON data in R.
- dplyr: A part of the tidyverse, dplyr provides a consistent set of verbs for data manipulation, making it easier to transform and summarize data frames.
- tidyr: Another tidyverse package, tidyr helps in creating tidy data sets by providing functions for reshaping data frames.
- stringr: This package provides a cohesive set of functions for string manipulation and regular expressions in R.
- lubridate: lubridate is designed to make it easier to work with date-time data in R.
- tidyverse: The tidyverse is a collection of R packages designed for data science, including dplyr, ggplot2, tidyr, and others.
- readtext: readtext makes it easier to import and handle text data in R, particularly for text mining and analysis.
- ggplot2: Part of the tidyverse, ggplot2 is a powerful data visualization package for creating complex and publication-ready plots.
- visNetwork: visNetwork is a package for creating interactive network visualizations in R, using the vis.js library.
- ggpubr: ggpubr provides easy-to-use functions for creating publication-ready plots and combining multiple ggplot2 plots into a single figure.
- igraph: igraph is a collection of network analysis tools for creating and analyzing graphs and networks in R.
patchwork: patchwork is a package for composing multiple ggplot2 plots into a single figure, with easy layout control.
igraph: A package for creating and analysing graphs and networks.
ggraph: A package for creating graph-based data visualisations using the ‘ggplot2’ syntax.
ggrepel: A package for automatically adjusting text labels to avoid overlapping in ‘ggplot2’ visualisations.
Click to show code
Click to show code
mc3 <- fromJSON("data/mc3.json")To effectively perform temporal analysis on such data, it can be beneficial to separate the data set into two distinct data frames: one for the node entities and another for the relationships or edges between them.
Click to show code
# Load necessary libraries
library(dplyr)
library(tidyr)
library(stringr)
# Assuming mc3 is already loaded as a list containing nodes dataframe
mc3_nodes <- as_tibble(mc3$nodes) %>%
mutate(
id = as.character(id),
type = as.character(type),
country = as.character(country),
ProductServices = as.character(ProductServices),
revenue_omu = as.numeric(revenue),
head_of_org = as.character(HeadOfOrg),
TradeDescription = as.character(TradeDescription),
PointOfContact = as.character(PointOfContact),
id = gsub("^[0-9]+\\.\\s*", "", id), # Clean up IDs
revenue_omu = ifelse(is.na(revenue_omu), 0, revenue_omu)
) %>%
mutate(
type = ifelse(type == "Entity.Person", "Entity.Person.Person", type) # Rename Entity.Person
) %>%
rename(
last_edited_by = `_last_edited_by`,
last_edited_date = `_last_edited_date`,
date_added = `_date_added`,
raw_source = `_raw_source`,
algorithm = `_algorithm`
) %>%
select(everything())
# Display the first few rows of the updated dataframe
print(head(mc3_nodes))# A tibble: 6 × 17
type country ProductServices PointOfContact HeadOfOrg founding_date revenue
<chr> <chr> <chr> <chr> <chr> <chr> <dbl>
1 Entity… Uziland Unknown Rebecca Lewis Émilie-S… 1954-04-24T0… 5995.
2 Entity… Mawala… Furniture and … Michael Lopez Honoré L… 2009-06-12T0… 71767.
3 Entity… Uzifri… Food products Steven Robert… Jules La… 2029-12-15T0… 0
4 Entity… Islava… Unknown Anthony Wyatt Dr. Víct… 1972-02-16T0… 0
5 Entity… Oceanus Unknown Tracy Schmidt Jacques … 1954-04-06T0… 4747.
6 Entity… Imazam Fish, crustace… Corey Moore J… Thierry … 2031-09-30T0… 46567.
# ℹ 10 more variables: TradeDescription <chr>, last_edited_by <chr>,
# last_edited_date <chr>, date_added <chr>, raw_source <chr>,
# algorithm <chr>, id <chr>, dob <chr>, revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type) %>% unique()[1] "Entity.Organization.Company"
[2] "Entity.Organization.LogisticsCompany"
[3] "Entity.Organization.FishingCompany"
[4] "Entity.Organization.FinancialCompany"
[5] "Entity.Organization.NewsCompany"
[6] "Entity.Organization.NGO"
[7] "Entity.Person.Person"
[8] "Entity.Person.CEO"
# Further processing to separate the 'type' column
mc3_nodes <- mc3_nodes %>%
separate(type, into = c("type_1", "type_2", "type_3"), sep = "\\.", fill = "right", extra = "drop")
# Display the first few rows of the updated dataframe
print(head(mc3_nodes))# A tibble: 6 × 19
type_1 type_2 type_3 country ProductServices PointOfContact HeadOfOrg
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Entity Organization Company Uziland Unknown Rebecca Lewis Émilie-S…
2 Entity Organization Company Mawalara Furniture and … Michael Lopez Honoré L…
3 Entity Organization Company Uzifrica Food products Steven Robert… Jules La…
4 Entity Organization Company Islavara… Unknown Anthony Wyatt Dr. Víct…
5 Entity Organization Company Oceanus Unknown Tracy Schmidt Jacques …
6 Entity Organization Company Imazam Fish, crustace… Corey Moore J… Thierry …
# ℹ 12 more variables: founding_date <chr>, revenue <dbl>,
# TradeDescription <chr>, last_edited_by <chr>, last_edited_date <chr>,
# date_added <chr>, raw_source <chr>, algorithm <chr>, id <chr>, dob <chr>,
# revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type_2) %>% unique()[1] "Organization" "Person"
(mc3_nodes$type_3) %>% unique()[1] "Company" "LogisticsCompany" "FishingCompany" "FinancialCompany"
[5] "NewsCompany" "NGO" "Person" "CEO"
Click to show code
mc3_edges <- as_tibble(mc3$links) %>%
mutate(
type = as.character(type),
type_new = stringr::str_extract(type, "[^.]+$"),
source = as.character(source),
target = as.character(target),
last_edited_by = as.character(`_last_edited_by`),
last_edited_date = as.Date(`_last_edited_date`),
date_added = as.Date(`_date_added`),
raw_source = as.character(`_raw_source`),
algorithm = as.character(`_algorithm`)
) %>%
mutate(
start_date = as.Date(start_date),
end_date = if_else(is.na(end_date), as.Date(NA), as.Date(end_date)),
id = target
) %>%
# Remove `_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm` if they exist
select(-c(`_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm`)) %>%
rename(
last_edited_by = last_edited_by,
last_edited_date = last_edited_date,
date_added = date_added,
raw_source = raw_source,
algorithm = algorithm
) %>%
select(everything())Click to show code
mc3_edges <- mc3_edges %>%
mutate(
start_date = as.POSIXct(start_date),
last_edited_date = as.POSIXct(last_edited_date),
date_added = as.POSIXct(date_added),
end_date = if_else(is.na(end_date), as.POSIXct(NA), as.POSIXct(end_date)),
id = target
)
mc3_nodes <- mc3_nodes %>%
mutate(
last_edited_date = as.POSIXct(last_edited_date),
date_added = as.POSIXct(date_added),
founding_date = as.POSIXct(founding_date),
dob = as.POSIXct(dob)
)Click to show code
mc3_edges [duplicated(mc3_edges ),]# A tibble: 0 × 13
# ℹ 13 variables: start_date <dttm>, type <chr>, source <chr>, target <chr>,
# key <int>, end_date <dttm>, type_new <chr>, last_edited_by <chr>,
# last_edited_date <dttm>, date_added <dttm>, raw_source <chr>,
# algorithm <chr>, id <chr>
mc3_nodes [duplicated(mc3_nodes ),]# A tibble: 0 × 19
# ℹ 19 variables: type_1 <chr>, type_2 <chr>, type_3 <chr>, country <chr>,
# ProductServices <chr>, PointOfContact <chr>, HeadOfOrg <chr>,
# founding_date <dttm>, revenue <dbl>, TradeDescription <chr>,
# last_edited_by <chr>, last_edited_date <dttm>, date_added <dttm>,
# raw_source <chr>, algorithm <chr>, id <chr>, dob <dttm>, revenue_omu <dbl>,
# head_of_org <chr>
Click to show code
nodes <- mc3_nodes %>%
select(
type_2, type_3, id, dob, country,
head_of_org, revenue, last_edited_by,
last_edited_date, date_added, raw_source, algorithm
)
edges <- mc3_edges %>%
select(
type_new, id, start_date, end_date,
last_edited_by, last_edited_date, date_added,
raw_source, algorithm
)What are the Company Types?
# Count the unique values in type_2
type_2_unique_counts <- mc3_nodes %>%
group_by(type_2) %>%
summarise(count = n()) %>%
arrange(desc(count))
# Display the summary table
print(type_2_unique_counts)# A tibble: 2 × 2
type_2 count
<chr> <int>
1 Person 51649
2 Organization 8871
Extracting the top 10 id to see the different company types in edges and nodes file.
Click to show code
top_10_ids <- edges %>%
group_by(id) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10)Create a barchart for nodes and edges.
Click to show code
barchart_type_counts_nodes <- mc3_nodes %>%
count(type_3, sort = TRUE)
barchart_type_counts_edges <- mc3_edges %>%
count(type_new, sort = TRUE)
b1 <- ggplot(barchart_type_counts_nodes, aes(x = reorder(type_3, -n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(
title = "Distribution of Nodes Types",
x = "Nodes Type",
y = "Count"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.8, size = 7)
)
b2 <- ggplot(barchart_type_counts_edges, aes(x = reorder(type_new, -n), y = n)) +
geom_bar(stat = "identity", fill = "lightblue") +
labs(
title = "Distribution of Edges Types",
x = "Edge Type",
y = "Count"
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 7)
)combined_plot <- b1 / b2
print(combined_plot)
Top companies and the Types
Network Graph
Click to show code
library(dplyr)
library(igraph)
library(visNetwork)
# Assuming mc3_edges is already loaded as a dataframe
# Get the top 20 companies based on the count of edges
top_companies <- mc3_edges %>%
count(id, sort = TRUE) %>%
top_n(20, wt = n)
# Filter edges to include only those involving the top 20 companies
filtered_edges <- mc3_edges %>%
filter(id %in% top_companies$id) %>%
select(id, type_new)
# Prepare edges for graph creation
edges_for_graph <- filtered_edges %>%
rename(from = id, to = type_new)
# Create a bipartite graph
bipartite_graph <- graph_from_data_frame(d = edges_for_graph, directed = FALSE)
# Assign types for bipartite mapping
V(bipartite_graph)$type <- bipartite_mapping(bipartite_graph)$type
# Prepare nodes for visualization
nodes_vis <- data.frame(
id = V(bipartite_graph)$name,
label = V(bipartite_graph)$name,
group = ifelse(V(bipartite_graph)$type, "Type", "Company")
)
# Prepare edges for visualization
edges_vis <- igraph::as_data_frame(bipartite_graph, what = "edges")
# Create and customize the visNetwork graph
vis1 <- visNetwork(nodes_vis, edges_vis, width = "100%", height = "800px") %>%
visNodes(shape = "dot", scaling = list(label = list(enabled = TRUE))) %>%
visEdges(arrows = "none", color = list(color = "lightgray", highlight = "red")) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
visLegend() %>%
visGroups(groupname = "Type", shape = "dot", color = "blue") %>%
visGroups(groupname = "Company", shape = "dot", color = "green") %>%
visLayout(randomSeed = 21) %>%
visInteraction(navigationButtons = TRUE) %>%
visPhysics(stabilization = FALSE, enabled = FALSE)# Display the graph
vis1Companies and amount of types
Click to show code
top_10_ids <- mc3_edges %>%
group_by(id) %>%
summarize(count = n()) %>%
arrange(desc(count)) %>%
slice(1:10)
filtered_edges <- mc3_edges %>%
filter(id %in% top_10_ids$id)
type_new_counts <- filtered_edges %>%
group_by(id, type_new) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(desc(count))
type_new_colors <- c("Shareholdership" = "pink",
"BeneficialOwnership" = "lightblue",
"WorksFor" = "lightgreen")
wrap_text <- function(text, width) {
wrapped_text <- str_replace_all(text, "-", " ")
wrapped_text <- str_wrap(wrapped_text, width = width)
wrapped_text <- str_replace_all(wrapped_text, " ", "-")
return(wrapped_text)
}
type_new_counts <- type_new_counts %>%
mutate(id_wrapped = wrap_text(id, 10))
type_new_counts <- type_new_counts %>%
mutate(id_wrapped = factor(id_wrapped, levels = type_new_counts %>%
group_by(id_wrapped) %>%
summarize(total_count = sum(count)) %>%
arrange(desc(total_count)) %>%
pull(id_wrapped)))
b4 <- ggplot(type_new_counts, aes(x = id_wrapped, y = count, fill = type_new)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = 0.8, size = 3) + # Add text labels
labs(x = " ", y = " ", title = "Top Names with Multiple Type Relationships") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8),
legend.position = "bottom"
) +
scale_fill_manual(values = type_new_colors, name = "Type")print(b4)
Which Companies has the highest Revenue?
Click to show code
library(plotly)
company_revenue <- mc3_nodes %>%
group_by(id, type_3) %>%
summarise(total_revenue = sum(revenue_omu, na.rm = TRUE), .groups = 'drop')
plot <- plot_ly(data = company_revenue,
x = ~id,
y = ~total_revenue,
text = ~paste("Name:", id,
"<br>Total Revenue:", total_revenue,
"<br>Type:", type_3),
type = "scatter",
mode = "markers",
marker = list(size = ~total_revenue / 1000000,
sizemode = 'area',
sizeref = 0.1,
color = ~total_revenue,
colorscale = "Viridis",
showscale = TRUE))
plot <- plot %>%
layout(title = "Interactive Bubble Chart of Company Revenue",
xaxis = list(title = "", showticklabels = FALSE), # Hide x-axis labels
yaxis = list(title = "Total Revenue"),
hovermode = "closest")plotIrregular Patterns by Revenue
Extracting year from full date as using the full date format is too detail for analysis.
edges <- edges %>%
mutate(
Start_Year = year(start_date),
End_Year = year(end_date)
)
nodes <- nodes %>%
mutate(added_year = year(date_added))The id with top 8 revenues had been extracted and visualise using heat map.
Click to show code
edges <- edges %>%
mutate(
Start_Year = year(start_date),
End_Year = year(end_date)
)
nodes <- nodes %>%
mutate(added_year = year(date_added))
top_8_revenue_nodes <- nodes %>%
arrange(desc(revenue)) %>%
slice_head(n = 8)
MC2_node_abnor <- edges %>%
filter(id %in% top_8_revenue_nodes$id) %>%
group_by(id, Start_Year) %>%
summarise(weight = n(), .groups = "drop")
MC2_node_abno <- edges %>%
filter(id %in% top_8_revenue_nodes$id) %>%
group_by(id, Start_Year) %>%
summarise(weight = n(), .groups = "drop")
g1 <- ggplot(MC2_node_abnor, aes(Start_Year, id)) +
geom_tile(aes(fill = weight)) +
geom_text(aes(label = weight), size = 3) +
labs(title = "Irregular Pattern by Revenue") +
scale_fill_gradient(low = "white", high = "lightblue") +
theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 8)) +
theme(axis.text.y = element_text(size = 6))print(g1)
Network Visualization with Node and Edge Groups
This graph visualises the relationships between various entities, highlighting different types of organisations and their connections.
Click to show code
mc3_nodes2 <- mc3_nodes %>%
mutate(id = as.character(id),
type_3 = as.character(type_3))
mc3_edges2 <- mc3_edges %>%
mutate(source = as.character(source),
target = as.character(target),
type_new = as.character(type_new))
edges2 <- mc3_edges2 %>%
select(from = source, to = target, type_new)
all_nodes <- unique(c(edges2$from, edges2$to))
nodes2 <- mc3_nodes2 %>%
filter(id %in% all_nodes) %>%
select(id, type_3) %>%
distinct() %>%
rename(name = id, group = type_3)
missing_nodes <- setdiff(all_nodes, nodes2$name)
if (length(missing_nodes) > 0) {
additional_nodes <- data.frame(name = missing_nodes, group = "Unknown")
nodes2 <- bind_rows(nodes2, additional_nodes)
}
nodes2 <- nodes2 %>%
distinct(name, .keep_all = TRUE)
graph <- graph_from_data_frame(d = edges2, vertices = nodes2, directed = TRUE)
nodes_vis <- data.frame(id = V(graph)$name, label = V(graph)$name, group = V(graph)$group)
edges_vis <- igraph::as_data_frame(graph, what = "edges")
vis1 <- visNetwork(
nodes_vis,
edges_vis,
width = "100%",
main = list(
text = "Network Visualization with Node and Edge Groups",
style = "font-size:17px; font-weight:bold; text-align:right;"
)
) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visGroups(groupname = "Unknown", color = "#D3D3D3") %>%
visGroups(groupname = "Company", color = "#1696d2") %>%
visLegend() %>%
visEdges(arrows = "to", color = list(color = "lightgray", highlight = "red")) %>%
visOptions(
highlightNearest = list(enabled = TRUE, degree = 2, hover = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE
) %>%
visInteraction(navigationButtons = TRUE) %>%
visPhysics(stabilization = FALSE, enabled = FALSE)
unique_types <- unique(edges2$type_new)
for (t in unique_types) {
vis1 <- vis1 %>%
visGroups(groupname = t, color = scales::hue_pal()(length(unique_types))[which(unique_types == t)])
}vis1Click to show code
filtered_edges <- edges %>%
filter(!is.na(start_date) & !is.na(end_date))
filtered_edges <- filtered_edges %>%
mutate(year = year(start_date))
top_entities <- filtered_edges %>%
group_by(type_new) %>%
count(id) %>%
top_n(10, wt = n) %>%
ungroup() %>%
pull(id)
filtered_edges_top <- filtered_edges %>%
filter(id %in% top_entities)
nodes <- filtered_edges_top %>%
select(id, type_new) %>%
distinct() %>%
rename(name = id)
all_nodes <- unique(c(filtered_edges_top$id, filtered_edges_top$type_new))
missing_nodes <- setdiff(all_nodes, nodes$name)
if (length(missing_nodes) > 0) {
additional_nodes <- data.frame(name = missing_nodes, type_new = NA)
nodes <- bind_rows(nodes, additional_nodes)
}
nodes <- nodes %>%
distinct(name, .keep_all = TRUE)
node_shapes <- c(
"Fishing-related Companies" = 22,
"Unknown" = 21,
"Non-fishing Companies" = 23,
"Beneficial Owner" = 24,
"Ultimate Beneficial Owner" = 25,
"Multi-role Entity" = 20
)
node_colors <- c(
"Fishing-related Companies" = "blue",
"Unknown" = "grey",
"Non-fishing Companies" = "green",
"Beneficial Owner" = "yellow",
"Ultimate Beneficial Owner" = "red",
"Multi-role Entity" = "purple"
)
graph <- graph_from_data_frame(d = filtered_edges_top, vertices = nodes, directed = TRUE)
V(graph)$degree <- degree(graph, mode = "all")
layout <- create_layout(graph, layout = "fr")
g <- ggraph(layout) +
geom_edge_link(aes(color = factor(year)), alpha = 0.5, edge_width = 0.8) +
geom_node_point(aes(x = x, y = y, size = degree, shape = type_new, fill = type_new), color = "black", stroke = 0.5) +
scale_shape_manual(values = node_shapes) +
scale_fill_manual(values = node_colors) +
scale_color_manual(values = rainbow(length(unique(filtered_edges_top$year)))) + # Assign colors to years
geom_text_repel(aes(x = x, y = y, label = name), size = 2.5) +
theme_void() +
theme(legend.position = "right") +
labs(title = "Network Visualization of Top 30 Entities by Type and Year",
color = "Year",
fill = "Type")print(g)
Activity Counts per Year with Outliers Highlighted
Click to show code
filtered_edges <- edges %>%
filter(!is.na(start_date) & !is.na(end_date))
filtered_edges <- filtered_edges %>%
mutate(year = year(start_date))
activity_counts <- filtered_edges %>%
group_by(year, id, type_new) %>%
summarise(activity_count = n()) %>%
ungroup()
activity_counts <- activity_counts %>%
group_by(type_new) %>%
mutate(
Q1 = quantile(activity_count, 0.25),
Q3 = quantile(activity_count, 0.75),
IQR = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR,
upper_bound = Q3 + 1.5 * IQR,
is_outlier = activity_count < lower_bound | activity_count > upper_bound
) %>%
ungroup()
o1 <- ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
geom_point() +
geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
facet_wrap(~ type_new) +
labs(title = "Activity Counts per Year with Outliers Highlighted",
x = "Year", y = "Activity Count", color = "Outlier") +
theme_minimal()print(o1)
Click to show code
filtered_edges <- edges %>%
filter(!is.na(start_date) & !is.na(end_date))
filtered_edges <- filtered_edges %>%
mutate(year = year(start_date))
activity_counts <- filtered_edges %>%
group_by(year, id, type_new) %>%
summarise(activity_count = n()) %>%
ungroup()
activity_counts <- activity_counts %>%
group_by(type_new) %>%
mutate(
Q1 = quantile(activity_count, 0.25),
Q3 = quantile(activity_count, 0.75),
IQR = Q3 - Q1,
lower_bound = Q1 - 1.5 * IQR,
upper_bound = Q3 + 1.5 * IQR,
is_outlier = activity_count < lower_bound | activity_count > upper_bound
) %>%
ungroup()
ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
geom_point() +
geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
facet_wrap(~ type_new, scales = "free_y") + # Using free_y scale for better visualization
labs(title = "Activity Counts per Year with Outliers Highlighted",
x = "Year", y = "Activity Count", color = "Outlier") +
scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red")) + # Color coding outliers
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 14, face = "bold")
)
outliers_summary <- activity_counts %>%
filter(is_outlier) %>%
select(year, id, type_new, activity_count) %>%
arrange(type_new, year, desc(activity_count))print(outliers_summary)# A tibble: 31 × 4
year id type_new activity_count
<dbl> <chr> <chr> <int>
1 2014 Wright LLC BeneficialOwnership 2
2 2015 Wright LLC BeneficialOwnership 2
3 2017 Nichols-Esparza BeneficialOwnership 2
4 2022 Wright LLC BeneficialOwnership 2
5 2023 Nichols-Esparza BeneficialOwnership 2
6 2027 Smith-Ramirez BeneficialOwnership 4
7 2029 Smith-Ramirez BeneficialOwnership 2
8 2030 Smith-Ramirez BeneficialOwnership 2
9 2031 Smith-Ramirez BeneficialOwnership 5
10 2032 Stephens-Lopez BeneficialOwnership 7
# ℹ 21 more rows
ggplot(outliers_summary, aes(x = year, y = id, fill = activity_count)) +
geom_tile(color = "white") +
scale_fill_gradient(low = "lightblue", high = "red") +
facet_wrap(~ type_new, scales = "free_y") +
labs(title = "Heatmap of Outliers in Activity Counts",
x = "Year", y = "Entity", fill = "Activity Count") +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(size = 14, face = "bold")
)